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1 Introduction 


It would be hard for me to find a better audience with which to discuss a 
particular question which interests me, and which I hope will interest you as 
well. 1 

The title of this talk is “Your Turing Machine or Mine?”. What I am allud- 
ing to with this title is the universality of a universal Turing machine, and at 
the same time, to the fact that there are many different universal Turing ma- 
chines with somewhat different properties. Universal Turing machines are both 
universal and individual, in different senses. A universal Turing machine is one 
that can emulate (or imitate) any other Turing machine, and thus in a sense can 
undertake to compute any of a very large class of computable functions. But 
there are an indefinitely large number of universal Turing machines that can be 
defined. 

What does this have to do with the learning of natural languages by human 
beings? A lot—and not very much! My purpose in this paper is to outline a way 
of understanding and evaluating linguistic research that seems to leave psycho- 
logical reality completely out of the picture. The goals of linguistics and those 
of psycholinguistics are, I think, very close and yet not identical. The linguist 
wants to discover of language can be learned—while the psycholinguist wants 
to discover how language is learned. Obviously anything learned by the one is 
going to be very relevant to the other, even if the two ask different questions. 
Lurking behind what I will suggest below is the belief that it is very hard for 
the field of linguistics to figure out how to interpret and integrate findings that 
derive from psychological studies: those findings can inspire linguistic research, 
but psychological studies are already, by their nature, so embedded within par- 
ticular views of what language is and how it works that it is extremely difficult, 
and perhaps impossible, to take them (the studies) as some sort of boundary 
conditions for linguistic analysis. My suggestion is that we develop and make 
explicit a strictly linguistic understanding of how language works, one which can 
be utilized by psycholinguists in turn. So it may be that the direction in which 


1 This is a draft of a paper to be presented at a workshop on machine learning and cognitive 
science in language acquistion, at the University College London on 21-22 June 2007, organized 
by Alex Clark and Nick Chater. I am endebted to many people for helping me in path that 
this paper describes—particularly what I have learned from a close reading of de Marcken 
1996 [5], and conversations with Aris Xanthos, Mark Johnson, Antonio Galves, and Jorma 
Rissanen. I am reasonably certain that as of the last formulation which each of these persons 
heard, none of them were in agreement with what I was suggesting, but who knows, one or 
more of them might possibly be in agreement with what I have written here. This paper 
expands some brief observations in [6]. 


I am struggling to go will turn some of you off. If so, I apologize in advance. 
I will admit that I am being pushed in this direction by the force of the ideas, 
almost against my own will. But there you are. 

The risk that this presentation runs is that you might think that it takes 
much too seriously a view of linguistics developed fifty years ago that even 
its original proponant (in this case, Noam Chomsky) does not take seriously 
anymore. I will try to explain why I think this classical generative view was an 
excellent beginning, but as its insufficiencies became apparent, the direction in 
which linguists modified it were not the most appropriate ways, given what we 
were learning (and continue to learn) about the abstract nature of learning. 

I began my research career as a convinced Chomskian, but you must bear in 
mind that what Chomsky was selling at that point was quite different from what 
he is selling today. What I would like to explain in the first part of this talk 
is what I think of as the “classical” theory of generative grammar, Chomsky’s 
first theory and the one that he laid out in The Logical Structure of Linguistic 
Theory. 

In typical Anglo-Saxon style, I will start by explaining to you exactly what I 
hope to say today. First I will describe the problem of linguistic theory as it was 
perceived by linguists in the mid 20th century, which is to say, by such people 
as Zellig Harris and Noam Chomsky. The problem was the same as the one 
that people like Ray Solomonoff were trying to deal with: how can we justify a 
formal grammar that makes an infinite number of predictions on the basis of a 
finite set of data? 

I will sketch Chomsky’s solution, and then compare it to Ray Solomonoff’s 
solution—worked out at the same moment and in the same city, in fact. This 
leads us to a discussion of MDL analysis, and how it is used currently to under- 
stand grammar induction. 

I will then take what looks like a big leap of faith, and suggest that the 
algorithmic complexity of a grammar is what we as linguists should care about, 
not complexity on the basis of a hypothetical or conjectured system of species- 
specific universal grammar. (People ask me, Why should we take algorithmic 
complexity as a serious candidate for linguists’ complexity? My reply is, why 
should we think that a linguist can improve on the mathematical notion of 
algorithmic complexity?) 

And now the problem, it seems to me, comes down to the fact that deciding 
on exactly the right measure of algorithmic complexity for a grammar may de- 
pend on precisely which Universal Turing Machine we employ. As I understand 
him, Rissanen’s view on this question is that this is small price to pay—but it 
is an unavoidable price to pay. There is, he believes, always an arbitrariness or 
a matter of “convention” (that is, of human choice) in deciding which universal 
Turing machine we employ to determine algorithmic complexity. I will offer an 
alternative, based on the notion that we linguists form only a finite set and that 
we communicate and, to some extent, we compete (or at least our theories do), 
and we can establish in a practical sense a good enough answer to the question, 
Your Turing Machine or Mine? The answer is: we, as a community of scientists, 
can develop an explicit measure that we all have reason to trust. The solution 
will turn out to be this: every linguist has the right to put forward a universal 
Turing machine, or UTM. He has a vested interest only in putting forward a 
Turing machine which favors his linguistic analyses, but he is not required to 
do so. When someone offers a new universal Turing machine to the community, 


he must also offer a suite of emulation software, allowing his UTM to emulate 
everyone else’s UTM that have already been accepted. Finally, the community 
as a whole chooses to use, as the scientific benchmark, that Universal Turing 
machine for which the actual total bit length of the emulation programs is the 
smallest. That is what I will suggest, and I actually mean this as a practical, 
not a totally abstract and theoretical proposal. We have some code writing to 
do, ASAP. 


2 Chomsky and classical generative grammar 


Classical generative grammar is the model proposed by Noam Chomsky in his 
Logical Structure of Linguistic Theory [1], and sketched in Syntactic Structures 
[2]. In a famous passage, Chomsky suggested comparing three models of what 
linguistic theory might be which are successively weaker, in the sense that each 
successive model does less than the preceding one. In the first model, linguists 
would develop a formal device that would produce a grammar, given a natural 
language corpus. In the second model, the formal device would not generate 
the grammar, but it would check to insure that in some fashion or other the 
grammar was (or could be) properly and appropriately deduced or induced from 
the data. In the third model, linguists would develop a formal model that neither 
produced nor verified grammars, given data, but rather, the device would take a 
set of observations, and a set of two (or more) grammars, and determine which 
one was the more (or most) appropriate for the corpus. Chomsky suggests that 
the third, the weakest, is good enough, and he expresses doubt that either of 
the first two are feasible in practice. 
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Chomsky's three views of linguistic theory 


Chomsky believed that we could and should account for grammar selection 
on the basis of the formal simplicity of the grammar, and that the specifics of 
how that simplicity should be defined was a matter to be decided by studying 
actual languages in detail. In the last stage of classical generative grammar, 
Chomsky went so far as to propose that the specifics of how grammar complexity 
should be defined is part of our genetic endowment. 

His argument against Views #1 and #2 was weak, so weak as to perhaps 
not merit being called an argument: it was that he thought that neither could 
be successfully accomplished, based in part on the fact that he had tried for 
several years, and in addition he felt hemmed in by the kind of grammatical 
theory that appeared to be necessary to give such perspectives a try. But 
regardless of whether it was a strong argument, it was convincing.” 


?Chomsky’s view of scientific knowledge was deeply influenced by Nelson Goodman’s view, 
a view that was rooted in a long braid of thought about the nature of science that has deep 
roots; without going back too far, we can trace the roots back to Ernst Mach, who emphasized 
the role of simplicity of data description in the role played by science, and to the Vienna Circle, 


Chomsky proposed the following methodology, in two steps. 

First, linguists should develop formal grammars for individual languages, 
and treat them as scientific theories, whose predictions could be tested against 
native speaker intuitions among other things. Eventually, in a fashion parallel 
to the way in which a theory of physics or chemistry is tested and improved, 
a consensus will develop as to the form and shape of the right grammar, for a 
certain number of human languages.’ 

But at the same time, linguists will be formulating their grammars with an 
eye to what aspects of their fully specified grammars are universal and what 
aspects are language-particular. And here is where the special insight of gener- 
ative grammar came in: Chomsky proposed that it should be possible to specify 
a higher-level language in which grammars are written which would have the 
special property that the right grammar was also the shortest grammar that 
was compatible with any reasonable sized sample of data from any natural lan- 
guage. If we could do that, Chomsky would say that we had achieved our final 
goal, explanatory adequacy. 

There were two other assumptions made (though occasionally questioned, 
as we will see) by this early Chomskian approach. The first [SMALL ROLE OF 
DATA] was that the role played by the data from the language was relatively 
minor, and could be used in a relatively simple manner to throw out or eliminate 
from consideration any grammar which failed to generate the data. A particular 
grammar g might be right for German, but it is not right for English, because 


which began as a group of scholars interested in developing Mach’s perspectives on knowledge 
and science. And all of these scholars viewed themselves, quite correctly, as trying to cope 
with the problem of induction as it was identified by David Hume in the late 18th century: 
how can anyone be sure of a generalization (especially one with infinite consequences), given 
only a finite number of observations? 

While it’s not our business here today to go through the work of this tradition in any 
detail, it is nonetheless useful to understand that the problem can be sliced into two parts, 
the symbolic and the pre-symbolic. The pre-symbolic problem is: how do we know the 
right way to go from observations that we make to statements which represent or encode the 
observations? Who determines what that process is, and how can we even try to make explicit 
what the right connections are? Fortunately for us, I have nothing to say about this extremely 
difficult problem, and will leave it utterly in peace. 

The second, symbolic problem matters to us, though. The second problem is based on the 
notion that even if it’s possible to make fully explicit the ways in which we must translate 
statements from one system of symbolic representation to another, the two systems may 
disagree with respect to what is an appropriate generalization to draw from the same set of 
observations. 

This was the problem that Chomsky proposed to solve, and he proposed to solve it by 
removing it from philosophy or epistemology, and moving it into science: the choice of the 
language in which generalizations are expressed cannot be decided on apriori principles, he 
suggested, and the methods of normal science should be enough to settle any question that 
might arise. 

3I am not going to criticize this suggestion in this paper, but I think that while this 
methodological principle sounds very good, it has proven itself to be unfeasible in the real 
world. That is, this principle amounts to the belief that there are methods of proving that 
grammar I does, or does not, represent the psychologically real grammar of English, or some 
other language, based no more on considerations of simplicity than any other laboratory 
science would use, and based on the same kinds of laboratory-based methods that any other 
science would use. This method has to be very reliable, because it serves as the epistemological 
bedrock on which the over-arching Universal Grammar will be based. This method, however, 
does not exist at the current time. I am not saying it could not exist; anyone is free to believe 
that, or not. But in 50 years of syntactic research, we have not developed a general set of 
tools for establishing whether a grammar has that property or not. But without a functioning 
version of this test, classical generative grammar cannot get off the ground. 


it fails to generate a large number of English sentences. The second [RANKING 
CAN BE BASED ON GRAMMAR, LENGTH] was that once we eliminate grammars 
from consideration that do not generate the right data, we can rank them on the 
basis of some formal property of the grammars themselves, and in particular we 
should be able to develop a formalism in which program length provides that 
formal property: the shortest grammar will always be the one predicted to be 
the right one. 

There are two fatal flaws to this program, I regret to say: I regret it because 
I think it sounds marvelous, and I would have been happy to devote many years 
to its development and execution. But there are two fatal flaws nonetheless. 
The first was that the SMALL ROLE OF DATA assumption is untenable, and 
there were people, like Ray Solomonoff, who were working to demonstrate this 
at exactly the same moment in the 1950s. 

The second fatal flaw is a bit more complex, but it can be summarized as 
this: Chomsky’s actual method of theory development put a strong emphasis 
on developing the grammar-writing linguistic theory (which he eventually called 
Universal Grammar) right from the start, and this led to a situation in which 
it seemed like the linguist needed to pay a certain “cost” for adding complexity 
to the grammars of individual languages, but there was no “cost” for adding 
complexity to the Univeral Grammar. This is a subtle point, and we can argue 
about it at some length; and it will be at the core of most of what I want 
to present to you this morning. I will call this second flaw the UNIVERSAL 
GRAMMAR IS FREE GRAMMAR fallacy. But first I want to talk briefly about 
probabilistic grammars as Solomonoff was developing them. I am pretty sure 
that most of you know this material already (that is what makes you a special 
group), but it’s important to spell out the general program and its context. 


3 Solomonoff and the logic of confirmation 


Ray Solomonoff was in Cambridge, Mass., working on the problem of induction 
of generalizations on the basis of finite data, at the same time that Chomsky 
was working on the problem of justifying grammars—that is, during the mid 
1950s. Solomonoff had been an undergraduate at the University of Chicago, 
where he had studied with Rudolf Carnap, the most influential of the members 
of the Vienna Circle who had come to the United States. During this period 
(1930s through 1950s), there was a lot of discussion of the nature of the logic of 
confirmation. There were many puzzles in this area, of which perhaps the most 
famous was the why it was or wasn’t the case that the observation of a white 
iPod confirmed the statement that all ravens are black. Since “All ravens are 
black” certainly appears to be equivalent to “anything which is not black is not 
a raven,” and my white iPod seems to confirm that, why doesn’t observation of 
my white iPod confirm an enormous number of irrelevant generalizations? 

Without pretending to have shed any light on the question, it would not be 
unfair to say that this example brings home the idea that there is a difference 
between “real” confirmation of a generalization by a particular observation, on 
the one hand, and “simple” consistency of an observation with a generalization. 
I recently noticed a comment made by Chomsky in Language and Mind, 1968 
[4] that made it clear that he was well aware of this issue, or so it seems to me. 
On pp. 76-77, he observes: 


A third task is that of determining just what it means for a hypoth- 
esis about the generative grammar of a language to be “consistent” 
with the data of sense. Notice that it is a great oversimplification 
to suppose thata child must discover a generative grammar that ac- 
counts for all the linguistic data that has been presented to him and 
that “projects” such data to an infinite range of potential sound- 
meaning relations....The third subtask, then, is to study what we 
might think of as the problem of “confirmation” —in this context, 
the problem of what relation must hold between a potential gram- 
mar and a set of data for this grammar to be confirmed as the actual 
theory of the language in question. 


The bottom line regarding probabilistic models is that it provides an answer 
to the question of how different grammars that generate the same language 
may be confirmed to different extents by the same set of observations. Each 
grammar assigns a finite amount of probability mass (in fact, it assigns exactly 
1.0 total units of probability mass) over the infinite set of predicted sentences, 
and each grammar will do that in a different way. We test the grammars by 
seeing what probability they assign to actual observed data, data that has been 
assembled in some fashion which is not biased with respect to intentionally 
producing data that pushes in favor of one person’s theory or another. We 
use the differing probabilities assigned by the different grammars to rank the 
grammars: all other things being equal, we prefer the grammar that assigns the 
most probability mass to data that had already independently been observed. 

To make a long story short, and shamelessly blending together Solomonoff’s 
work [9] with that other people’s (notably Kolmogorov, Chaitin, and Rissanen 
[8]; see [7], we can conclude the following: we can naturally assign a probability 
distribution over grammars, based on their code length in some appropriate 
universal algorithmic language, such as that of a Universal Turing machine; 
if such grammars are expressed in a binary encoding, and if they are “self- 
terminating” (i.e., have the prefix property: there is no grammar g in the set 
of grammars which is a prefix to some longer grammar g’), then we assign each 
grammar g the probability 27191 

We can draw this conclusion, then: 

If we accept the following assumptions: 


e Our grammars are probablistic (they assign a distribution to the repre- 
sentations they generate); 


e The goal, or one major goal, of linguistics is to produce gramamrs; 


e There is a natural prior distribution over algorithms, possibly modulo 
some concern over choice of universal Turing machine or its equivalent; 


then we can conclude: 


e there is a natural formulation of the question, what is the best grammar, 
given the data D? The answer is: 


arg max pr(g)pr(D\g) 


where 
pr(g) = 2718 


We can now describe the work of the linguist in an abstract and idealized 
fashion as follows. She has a sample of data from a set of languages, £. She has 
a computer and a computer language in which she develops the best grammar 
of each language individually. If she has only one language (English, say) in 
her corpus, then she will look for the grammar which maximizes the probability 
of the data by minimizing the description length of the data, which is to say, 
minimizing the sum of the length of the grammar plus the inverse log probability 
of the data, given the grammar. She will have no motivation for conceptually 
dividing her grammar of English into a part that is universal and a part that is 
English-particular. 

But suppose she (the linguist) is studying two languages, English and Japanese; 
English and Japanese have different structures, and probability must be assigned 
according to different models. Some parts of the model will be specifically 
set aside for treating sentences from English, some for treating sentences from 
Japanese, and other parts will be relevant for both. 

Grammar-writers’ premise: It is a reasonable goal for linguists to develop a 
Linguistic Theory which is a specific explicit way of writing grammars of any 
and all natural, human languages. This Linguistic Theory is in effect a higher- 
level computer language which, when given a complete grammar, can perform 
tasks that require knowledge of language, and only knowledge of language (like 
parsing, perhaps). 


4 Rissanen and There ain’t no such thing as a 
free UG 


I said before that there were two fatal flaws to the classical generative picture: 
the first was was the assumption that the complexity of Universal Grammar 
was cost-free for the scientist, the UNIVERSAL GRAMMAR IS FREE GRAMMAR 
fallacy, and the second was its failure to deal with the relationship between 
grammar and data (the “SMALL ROLE OF DATA” FLAW) . Classical generative 
grammar operated as if there were a valid principle in effect that the less in- 
formation the linguist included in his language-particular grammar, the better 
things were. This could be accomplished with or without increasing the com- 
plexity of UG, both intheory and in reality. When a grammatical proposal did 
not increase the complexity of UG but did simplify the grammatical description 
of a language, then I think everyone would agree that the change constituted an 
mprovement. But quite often, a proposal was made to simplify the description 
of a particular language (or two or five) by removing something that all these 
grammars shared, and placing them not in the particular grammars, but in the 
Universal Grammar common to all the grammars. 

The UNIVERSAL GRAMMAR IS FREE GRAMMAR fallacy is the following 
assumption: while the complexity of a particular grammar counts against it as 
a scientific hypothesis, and is an indirect claim about the information that must 
be abstracted from the “training data” by the language learner, the complexity 
of Universal Grammar has no cost associated with it from a scientific point 
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of view. Its complexity may be the result of millions of years of evolutionary 
pressure—or not; the linguist neither knows nor cares. 

I call this a fallacy, because it inevitably leads to a bad and wrong result. 
There is one very good reason for insisting that the researcher must take into 
consideration the informational cost of whatever he postulates in his Universal 
Grammar. If he does not do so, then there is a strong motivation for moving 
a lot of the specific detail of individual grammars into Universal Grammar, 
going so far as to even include perfectly ridiculous kinds of information, like the 
dictionary of English. UG could contain a principle like, if the definite article in 
a language is the, then it is SVO and adjectives precede their head nouns. That 
would simplify the grammar of English, and if there is no cost to putting it in 
UG—if that does not decrease the prior probability of the entire analysis—then 
the rational linguist will indeed put that principle in UG.* 


5 The bigger picture 


The picture of linguistics that we have arrived at is this: we are all seeking 
a Universal Grammar, UG}, which will run on an all-purpose computer, and 
which will accept grammars and corpora, and produce grammatical analyses of 
the corpora. With this UG, we can do a good job (though not a perfect job) 
of finding the best grammar for each language for which we have data; we do 
this by selecting grammars for each language, such that the overall sum, over 
all the languages, of the length of the grammar plus the compressed length of 
the data is maximally compact (i.e., minimized). 

What could possibly go wrong in this beautiful picture? It would seem that 
this is the garden of paradise: if someone finds a way to improve the grammar 
of one language, while not making the grammar of the other languages worse, 
then we will be sure just by running the numbers, so to speak, on the computer. 
Different linguists may have different hunches, but we will be able to easily 
distinguish a hunch from an advance or breakthrough. 


4Now, there is a natural way to understand this which is valid and convincing, in my 
opinion at least. We can consider the goal and responsibility of the linguist to account for all 
observed languages, and if we really can find some elements in all our grammars of languages 
and then place them in a common subroutine, so to speak, changing nothing else. then we 
will save on the overall length of our grammatical model of the universe: if we shift N bits 
from individual grammars to universal grammar, and the cost of a pointer to this code is q 
bits, and if there are L languages in the current state of linguistic research that have been 
studied, then over-all we will save (L — 1)N — Lq bits. Hence the prior probability of our 
account of the world’s data will have just been multiplied by 2.2-)N—£4, which is no mean 
feat. 

Let’s push this a bit. What if out of our L different languages, half of them need to refer 
to this function we have just shifted to Universal Grammar? Then each language must be 
specified for whether it “contains” that material, and that will cost 1 bit (more or less) for 
each language. Then moving this out of individual grammars and into UG will save us only 
(L/2—1)N — L — 42 bits. 

Generalizing this a bit, and assuming that Lye, languages contained this material in their 
grammar, and Lro did not (so that L = Lyes + Ino), then shifting material from individual 
grammars to UG will save us (Lyes — 1)N — Lyeslog Ta Imolog Ta qLyes bits.6 


It may not appear it at first blush, but there is some real news here for someone defending 
this rationalist view of Universal Grammar. Such a person must find the least costly way to 
formulate any addition to Universal Grammar (that value is N), and in addition he must pay 
a price for every language in the world that does not need to refer to this aspect of Universal 
Grammar. Is that true? Or can the particular languages’ grammar just pass over it in silence? 


There is one thing wrong, however. We must worry about the possibility that 
there are two groups of linguists, using different underlying computers (different 
Universal Turing Machines, that is) who arrive at different conclusions. It is not 
enough to require that the linguist pay for the algorithmic cost of his Universal 
Grammar. That cost can always be shifted to the choice of universal Turing 
machine s/he employs. And every linguist will be motivated to find or design the 
Universal Turing Machine that incorporates as much as possible of the linguist’s 
Universal Grammar, making a mockery out of the requirement we have already 
discussed. What do we do now? 


6 The limits of conventionalism for UTMs 


We now arrive at what is nearly the final point of our journey. I imagine an 
almost perfect scientific linguistic world in which there is a competition between 
a certain number of groups of researchers, each particular group defined by 
sharing a general formal linguistic theory. The purpose of the community is to 
play a game by which the best general formal linguistic theory can be encouraged 
and identified. Who the winner is will probably change over time as theories 
change and develop. Let us say that there are N members (that is, N member 
groups). To be a member of this club, you must subscribe to the following (and 
let’s suppose you’re in the group numbered #1): 

1. You adopt an approved Universal Turing machine (UT M!+). I will explain 
later how a person can propose a new Turing machine and get it approved. But 
at the beginning, let’s just assume that there is a set of approved UTMs, and 
each group must adopt one. I will index different UTM’s with integers or with 
superscript lower-case Greek letters, like UTM®: that is a particular approved 
universal Turing machine; the set of such machines that have already been 
approved is U. 

2. You adopt a set of corpora which constitutes the data for various lan- 
guages; everyone in the group must adopt all approved corpora. Any member 
can propose new corpora for new or old languages, and any members can chal- 
lenge already proposed corpora. At any given moment, there is an approved and 
accepted set of data. The set of languages we consider is £, and I is a variable 
indexing over that set £; the corpora form a set C, and the corpus for language 
lis Ci. 

3. The activities involved in this competition are the following. You will 
have access to the data C, and you will select a Universal Turing Machine of 
your choice, you will come up with a Universal Grammar UG, and a set of 
grammars {T;} for each language. Again, a bit more slowly: 


1. You will write a Universal Grammar which runs on the UTM that you 
have adopted, UTM! in your case, let’s say. There will be a broadness 
condition on the Universal Grammars U that a group proposes which I 
will return to in a moment. You will compute the length of your UG; as 
it runs as a compiler on your UTM. This is a positive real number, noted 
|\UGilura More on what that means also, below. 


2. You will write a probabilistic grammar for each language l € £, and these 
grammars are written in the formal language required by your Universal 
Grammar. We will indicate a grammar by an upper case gamma T; a 
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particular grammar is [;,. Every Universal Grammar must have the ability 
to calculate the length of any grammar proposed by any group—not just 
grammars that their own group proposes, and this length must be finite 
(this is a “broadness” condition). But different Universal Grammars will 
allow some statements to be made in a very concise and economical way, 
whereas other UGs might require hundreds or thousands of bits to express 
the same thing. 


3. You will use the probabilistic grammars to extract redundancies from the 
corpora, and then 


4. You will calculate two quantities: (a) the length of the Universal Grammar 
UG, and (b) the length of the linguistic analyses, which I will call the 
“empirical term”: that is, the lengths of all of the individual language 
grammars plus the compressed lengths of the corpora for all the languages. 
Symbolically, we can express the length of the linguistic analyses with 
an “empirical term” for a given triple, consisting of: UTM®; a set of 
grammars {I}; and the length of the unexplained information in each 
corpus, as in (1): 


Emp(UTM®, {11}, C) = X` |Pilurme — log prr, (C1) (1) 
lEL 


This group is trying to minimize the quantity: 


|\UGlurme + Emp(UTM®, {Ti},C) (2) 


which is essentially the minimal description length of the data, given UT M1. 

Another way to put this is that we are doing standard MDL analysis, but 
restricting our consideration to the class of models where we know explicitly how 
to divide the models for each language into a universal part and a language- 
particular part. This is the essential ingredient for playing the intellectual game 
that we call theoretical linguistics. 

We might then imagine you win the competition if you can demonstrate that 
the final sum that you calculate in (3) is the smallest of all the groups. But 
this won’t work correctly, because it is perfectly possible (indeed, I think it is 
unavoidable’) that two competitors will each find that their own systems are 
better than their competitors’ systems, because of the UTM that they use to 
find the minimum. That is, suppose we are talking about two groups, Group 1 
and Group 2, which utilize UTM® and UTM®. 

It is perfectly possible (indeed, it is natural) to find that 


|UGilurme + Emp(UTM®, {T)}',C) < |UG;|urme + Emp(UTM®, {T1}’,C) 
(3) 

and yet, for a value of 8 different from a: 
\UGilurme + Emp(UTM®, {T)}",C) > [UG j|urme + Emp(UTM®, {Ti}*,C) 
(4) 


TIf a group wants to win the competition as I have defined it so far, they can modify their 
UTM to make the Universal Grammar arbitrarily small. 
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This is because each group has a vested interest in developing a UTM which 
makes their Universal Grammar extremely small. This is just a twist, just a 
variant, on the problem described in the UNIVERSAL GRAMMAR IS FREE fallacy 
that I discussed above. 


7 Which Turing machine? The cheapest one 


With all of this bad news about the difficulty of choosing a univerally accepted 
Universal Turing machine, how can we play this game fairly? Here is my sug- 
gestion: 

The general problem is this: we have a set of N approved UTMs. A UTM 
is, by definition, a machine which can be programmed to emulate any Turing 
machine. Let’s denote an emulator that makes machine; into a simulator of 
machine; as {i = j}, and the length of that emulator as [i > j]. We set a 
requirement that for each UTM; (i < N) in the group, there is a set of smallest 
emulators {i = j} (in a weak sense: these are the shortest ones found so far). 

When a group wants to introduce a new UTM UT My +1 to the group, they 
must produce emulators {i = N + 1} (for alli < N) and {N +1 => i } (for 
all i < N). But anyone who can find a shorter emulator can present it to the 
group’s archives at any point, to replace the older, longer emulator (that is, you 
do not have to be the original proposer of either i or j in order to submit a 
better (i.e., shorter) emulater i = j for the community). 

All of the infrastructure is set up now. What do we need to do to pick the 
best UTM? What we want to avoid is the case of a tricky linguist who develops 
a UTM (let’s call it UTM*) which makes his or her UG unreasonably short, 
but that property should be revealed by the way in which all the emulators for 
UTM* are all long. That is, the rational choice is to pick the UTM; which 
satisfies the condition: 

arg min 2l >a (5) 


A UTM? which is unfair is one which cannot be easily reproduced by other 
UTMs, which is the point of the choice function proposed here. 


8 Conclusions 


It is helpful, and maybe even important, to recognize that generative grammar 
has gone through three very different stages—at least Chomsky’s version has. 
In its first stage, the one I find appealing, it was a theory in line with the 
traditional American understanding of what linguistics is as a science: it is a 
traditional science of linguistic data. In the second period, beginning roughly 
with Aspects of the Theory of Syntax in 1965 [3], generative grammar was taken 
to be an abstract theory of human cognition, in some fairly strong sense (though 
in my opinion, a sense that was never adequately made clear). In the third 
period, starting around 2000, generative grammar has tried to package itself as 
an abstract theory of biology—presumably because psychologists made it clear 
that they did not see a way to acknowledge generative grammar as a theory of 
cognitive psychology. 
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I have suggested returning to the original conception of generative grammar, 
but modified quite a bit in response to what we have learned in the last 50 years 
about learning and computation. I do think that the theory that we would 
develop along the lines I have sketched is a theory of mind, but it is at a 
somewhat higher level of abstraction than that expected by most non-linguists 
who look for an account of what language is. Is it psychologically real? I would 
rather say No, but I would say that what we are aiming for is the correct mental 
model of human language. 
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